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PARALLELIZATION OF THE PIPELINED THOMAS ALGORITHM 

A. POVTTSKY * 


Abstract. In this study the following questions are addressed. Is it possible to improve the 
parallelization efficiency of the Thomas algorithm? How should the Thomas algorithm be formulated 
in order to get solved lines that are used as data for other computational tasks while processors arc 
idle? 

To answer these questions, two-step pipelined algorithms (PAs) arc introduced formally. It is shown 
that the idle processor time is invariant with respect to the order of backward and forward steps in PAs 
starting from one outermost processor. The advantage of PAs starting from two outermost processors 
is small. Versions of the pipelined Thomas algorithms considered here fall into the category of PAs. 

These results show that the parallelization efficiency of the Thomas algorithm cannot be improved 
directly. However, the processor idle time can be used if some data has been computed by the time 
processors become idle. To achieve this goal the Immediate Backward pipelined Thomas Algorithm 
(IB-PTA) is developed in this article. The backward step is computed immediately after the forward 
step has been completed for the first portion of lines. This enables the completion of the Thomas 
algorithm for some of these lines before processors become idle. An algorithm for generating a static 
processor schedule recursively is developed. This schedule is used to switch between forward and 
backward computations and to control communications between processors. The advantage of the IB- 
PTA over the basic PTA is the presence of solved lines, which are available for other computations, 
by the time processors become idle. 

Key words. Thomas algorithm, band matrix, pipelined algorithms, parallel computations, im- 
plicit numerical methods, MIMD computer 

Subject classification. Computer Science 

1. Introduction. Implicit methods arc widely used in computational mechanics. Usually, the 
systems of linearized and discretized equations obtained arc of a bandwidth type. 

A very efficient direct solver, known as the Thomas algorithm, is used for solution of these systems 
of equations in computational mechanics [l]. The Thomas algorithm represents a version of the Gauss 
Elimination method for band matrix systems. For multi-dimensional cases, the Alternative Direction 
Implicit (ADI) methods are based on solution of linear systems with band matrices corresponding to 
finite-difference discretization schemes in each direction. 

The usual way of parallelizing numerical schemes for MIMD computers is to partition the com- 
putational field into multiple subdomains, then to allocate each subdomain to a different processor. 
This is done so as to minimize communication, delay and processor idle time. 

Parallelization of implicit solvers that use the Thomas algorithm for the solution of large banded 
linear systems of equations is hindered by global spatial data dependencies. Parallel versions of the 
Thomas algorithm are of the pipelined type. A pipeline in a parallel program involves each processor 
performing the same set of operations on a successive (continuous) stream of data. Pipelines occur 
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supported by the National Aeronautic and Space Administration under NASA Contract No. NAS 1-97046 while the 
author was in residence at the Institute for Computer Applications in Science and Engineering (ICASE), NASA Langley 
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due to the recurrence of data within a loop [2]. The main disadvantage is that during the pipelined 
process some processors will be idle when we switch from the forward to the backward steps of the 
Thomas algorithm. 

In order to avoid pipelining, some parallel Thomas algorithms include the reduction of a banded 
system of equations on P “slave” processors, the solution of the reduced system of size O(P) on the 
“master” processor, the broadcast of this solution to the “slaves”, and simultaneous computation of 
the final solution on P processors. This algorithm includes global communications of “slave- master” 
type and additional computations on each “slave” processor. F. Gustavson and A. Gupta [3] recently 
have proposed such an improved parallel algorithm for tri-diagonal systems. This algorithm has a 
redundancy of two: it requires twice as many computational operations per grid point as the serial 
Thomas algorithm. 

Implementation of internal boundary conditions eliminates far-field data dependencies, allowing 
band matrix systems to be solved independently on each processor. However, modification of either the 
finite- difference approximation or the implicitness of the scheme due to interface boundary conditions 
can cause accuracy, stability and convergence properties to deteriorate relative to the original serial 
method [4]. 

Naik et al. [5] reduced the parallelization penalty of the solution of tri-diagonal systems by sending 
necessary information to neighboring processors for groups of rendered lines at forward and backward 
steps of the Thomas algorithm. The authors of [5] derived the optimal number of lines to be solved 
per message as a function of computation time per grid point and communication time. This pipelined 
Thomas algorithm (PTA) does not need global communication and performs the same computations 
as the serial Thomas algorithm; however, at two points in this algorithm each processor has to wait 
for all processors ahead. 

There is no available systematic study about various formulations of the pipelined Thomas al- 
gorithms, therefore we begin with theoretical proofs about processor idle time for these algorithms, 
using the unified approach of two-step pipelined algorithms. 

In this paper a new pipelined Thomas algorithm is developed. This algorithm is designed for 
parallel multi-domain solution of band matrix linear systems. We call it the Immediate Backward 
pipelined Thomas Algorithm (IB-PTA): the backward step is performed line-by-line immediately after 
the forward step has been completed for these lines. This algorithm provides exactly the same solution 
as the serial Thomas algorithm. 

The advantage of the IB-PTA over the basic PTA is that some lines have been completed by 
the backward step of the Thomas algorithm before the processors are idle. In non-linear and multi- 
dimensional problems the IB-PTA may be used for other computational tasks at the time w r hen 
processors are idle. These tasks include computation of the Jacobians of the original non-linear 
systems, computation of right hand sides of ADI equations and computation by the Thomas algorithm 
in the next spatial direction. Use of the IB-PTA should lead to a considerable reduction of idle 
processor time. 

Practical use of the proposed IB-PTA is based on a static processor schedule, which has been 
created before computations by the Thomas algorithm are executed. The rest of the article describes 
the recursive algorithm for the assignment of this schedule. 

Implementation of the IB-PTA within a 3-D PDE solver is the scope of another study [6]. 

The article contains four sections. In section 2, various formulations of the Thomas pipelined 
algorithms are discussed. In section 3, a formal definition and study of two-step pipelined algorithms 
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(PA) is presented. In section 4, the procedure to generate the computational and communicational 
schedule of processors is presented. 

2. The pipelined Thomas algorithms. The theory of pipelined Thomas algorithms developed 
in this paper can be applied to various types of matrices, including block bandwidth and n-width band 
matrices. In numerical analysis we usually have to solve the system of N x L equations which is split 
to L banded systems of N equations. Each banded system corresponds to one line of numerical grid, 
and commonly N / L. The set of L systems where each system is a scalar tridiagonal matrix N x N 
is taken here as an example: 

(1) akjXk-i f i + b k ,ix k,i + CkjXk+ij = fk t u k = 1> —j N, l = 1, ..., L, 

where a^i are the coefficients, arc the unknown variables, and L is the number of lines. 

The version of the Thomas algorithm for the scalar tridiagonal case is as follows. The forward (the 
first) step of the serial Thomas algorithm is 

d\ y i bk,i Q'kii 3 ^ 2, •••> AT, 



For pipelined Thomas parallel algorithms considered in this article, the coefficients of Eq. (1) are 
mapped onto P processors so that the subset 

{ak,h bkj) c k,h | k = N(p — l)/P + 1, ...» Np/P t l = 1, ..., L} belongs to the p th processor. 

We now briefly describe the pipelined Thomas Algorithm (PTA) [5]. Rendering the I th fine, 
the p th processor receives coefficients div(p-i)/p,/i 9 n ( p - i )/ p,i from the (p — l) </l processor; computes 
the forward step coefficients dkj and gkj, where k = N(p — 1)/P T l,...,ATp/P, sends coefficients 
dNp/P,h 9Np/P,i the (p -f 1)^* processor and repeats computations (2) for the next lines until all 
the forward step computations are completed. The order of computation of lines by the forward and 
backward steps of the PTA is shown in Fig. la. The three upper lines in Fig. la arc computed 
by the forward step computations. After completion of all forward step computations specific to a 
single processor, the p th processor (except last) has to wait for the completion of the forward step 
computations by all processors ahead of it. The last outermost ( P th ) processor starts the backward 
step computations (3) first. Other processors proceed with the backward step computations in the 
similar manner as the forward step computations. The three lower lines in Fig. la represent the 
backward step computations. The coefficients d and g of the forward step at interface boundaries are 
required for beginning of computations in the (p + l) th processor; similarly the computed x values of 
the backward step are required for the beginning of computations in the (p — l) th processor. 

Processors are idle between the forward and backward steps (see Fig 3a), and there is no data 
available for other computational tasks by this time. 

The algorithm denoted as the Immediate Backward PTA (IB-PTA) is shown in Fig. lb. It is 
assumed that the computational time per part of a line, corresponding to the steps performed by a 
single processor, is equal to unity for both forward and backward steps. Actually these times are 
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different for forward and backward steps of Thomas algorithm. This issue will be discussed in the 
next sections. First, lines arc computed by the forward step until the first line is completed on the last 
processor (the two upper lines in Fig. lb). Then the backward step computations for each line start 
immediately after the completion of the forward step on the last processor (the next four lines in Fig. 
lb). Each processor switches between the forward and backward steps of the Thomas algorithm as 
shown in Fig. 2a at two sequential time units. Each processor communicates with its neighbors to get 
necessary data for beginning of cither the forward or backward computations for next line. Finally, 
remaining lines are computed by the backward step computations and there arc no available lines for 
the forward step computations (the two lower lines in Fig. lb). 

Two other versions of the pipelined Thomas algorithms are shown in Fig. lc-d. These formulations 
of the PTA, denoted as the first-last pipelined Thomas algorithms (FL-PTAs), start the forward step 
computations from the two outermost processors (indeed, each line starts from one side). Each 
processor switches between pipelined computations in increasing (from the first to the P th processor) 
and decreasing directions. Four upper lines in Fig. lc are computed by forward step computations. 
Outermost processors start their forward step computations simultaneously, and middle processors 
arc idle waiting for available data. Then outermost processors start the backward computations 
simultaneously and perform it in the same way as the forward step computations (the four lower lines 
in Fig. lc). Combination of the FL-PTA and the IB-PTA leads to the First-Last Immediate Backward 
Pipelined Thomas algorithm (FL- IB-PTA), where four fluxes of data are treated simultaneously (Fig. 
Id). Outermost processors start the forward step computations simultaneously as in the case of the 
FL-PTA. The backward step computations for each line start immediately after completion of this 
line by the forward step computations. 

Now we can calculate the processor idle time for the basic PTA and the IB-PTA. The idle time 
for the p th processor is time between the start of the forward step computations and completion of 
the backward step computations on the p th processor, when this processor is idle. The maximum 
processor idle time for the system of P processors is equal to the maximum of the idle times of 
individual processors. 

In the basic PTA, the p ih processor waits for the completion of the forward step computations for 
the last line by P — p processors ahead of it, and for the completion of the backward step computations 
for the first line by the same P — p processors. Thus, the processor idle time is equal to 2(P — p) for 
the p th processor (see Fig 3a). 

Performing the IB-PTA, the p th processor has computed 2(P — p) -f 1 lines by the forward step 
before it starts the backward step computations for the first line. At that point, each processor spends 
one half of its time computing new lines by the forward step and the other half computing the previous 
lines by the backward step. Once all lines are completed by the forward step, processors are busy 
only one half of the time executing the backward step computations (sec Fig. 3b). At this time there 
are 2(P — p) + 1 lines uncompleted by the backward step. Thus, the idle time of the p th processor 
2(P — p) is the same as for the PTA. 

Obviously, there are a lot of possible versions of PTAs. In order to predict the processor idle time 
for any PTA, a unified approach is developed in the following section. 

3. Theoretical estimation of processor idle time for the PTA. 

3.1. Two-step pipelined algorithms. Suppose there arc L lines of data. Each fine is split 
equally between P processors and is treated as follows: The first (forward) step computations start 
from the first outermost processor and run in a pipelined way to the last outermost processor. The 
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second (backward) step for any line is performed from the P th to the first processor after the first 
step has been completed for this fine. 

This algorithm is fully characterized by the I\ (p, l ) and h{p, 0 functions that are the times of the 
beginning of the first and second step computations, respectively, for the I th line on the p th processor. 
The computational time per fraction of a line assigned to a single processor is equal to unity for both 
(forward and backward) steps on a single processor. 

To make the algorithm valid, the following relations must be satisfied for the I i and I 2 functions: 


(4) ± h(p,j)yr,s e {1,2}, i^j; 

(5) 1,*) + 1, P= 2, 

(6) I 2 (P,i)>h(P,i) + h 

(7) h{p,i) > h{p + M) + 1, p-P- 1, 1, 


where i,j = 1, ..., L are line numbers, p — 1, ..., P are processor numbers, and r € {1, 2} and s € {1,2} 
are indices specifying either the Ii or I 2 function. 

Condition (4) states that each processor computes one step (forward or backward) for a single 
line per time unit. The pipelined nature of the algorithms considered here leads to inequalities (5, 
7) for the first and second steps, respectively. Inequality (6) corresponds to the condition that the 
second step starts after completion of the first step for each line. 

Additionally, if there are some available lines that have been completed by a previous processor 
or that arc ready to start from the outermost processor, the line with minimum sequential number 
is computed first. Thus, the outermost processors execute the forward and backward steps for lines 
in increasing order. This requirement, denoted as A-rcquirement, is made for convenience in the 
following discussion. 

The maximum elapsed time is defined for a system of processors as the time interval between the 
beginning of computation of the first line and the completion of computation of the last line. For the 
PA, the elapsed time is equal to 


(8) T e = 7 2 (l,L)-/ 1 (l,l) + l. 

The useful time of each processor is equal to 2 L, as computational time is equal to unity per 
fraction of line assigned to a single processor. The maximum processor idle time for the system of 
processors is defined as the maximum idle time over all processors performing the PA, and is equal to 
T e - 2 L. 

The following theorem is valid for the PA. 

Theorem 3.1. For the PA, the maximum elapsed time is greater than or equal to 2 L + 2 (P — 1). 
There exists functions I\ and I 2 satisfying conditions (4~V such that T e = 2L + 2(P — 1) for these I\ 
and 12 - 

Proof The proof of the first part of the theorem is done by the method of mathematical induction 
on the number of processors P, 

For P = 1 (a single processor is involved) the elapsed time is equal to 2 L for any number of lines 
L. 
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Suppose for p < P the theorem is satisfied. Let the (P ■ f l) t/l processor render the same L lines. 
Assume, to show a contradiction, that for P -f 1 processors the elapsed time is 

(9) Te(P+ 1)<2L + 2((P+1)-1) 
for some particular Ii and 7 2 satisfying (4-7). 

Now switch off the first processor and consider the system of P processors 2, P + 1 with Pf ew 
and I£ ew as follows: 

(10) I?™(p,i) = I r (p+l,i), p=l,...,P, re {1,2} 

Note that the functions I\ (p, l) and 7 2 (p, /), where p = 2, P+ 1, remain the same as for the assumed 
system of P+ 1 processors. This set (I™ €W , I% ew ) satisfies (4-7). The elapsed time for this system of 
P processors is equal to 

(11) T e (P) = ir w (l,L) - ir w ( 1 , 1) + 1 = 7 2 (2,L) - 7i(2, 1) + 1 . 

According to conditions (5,7) 

(12) h{ 2, L) < 7 2 (1, L) - 1, 7i(2, 1) > 7 1 (1, 1) + 1. 

Therefore, 

(13) T e (P) < T e (P + 1) - 2 < 2L + 2(P - 

This contradiction of the induction hypothesis ( T e (P ) > 2L + 2(P 
the theorem. 

One particular distribution of I\ and 7 2 

(14) Ii(p,l)=p + l- 1, 7 2 (p,l) = P — p + l-\-(P + L-l) = 2P + L + /— p— 1 
meets conditions (4-7) and 

(15) T e = 7 2 (1, L) - ^(l, 1) + 1 = 2L + 2(P - 1). 

This proves the second statement of Theorem 3.1. □ 

The distribution (14) of 7j and 7 2 corresponds to the basic PTA (see the previous section). 

There is the following corollary from Theorem 3.1. 

COROLLARY 3.2. If the maximum idle time for P processors performing PA is equal to 2(P — 1) 
then the idle time of the p th processor is equal to 2 (P — p). 

Proof The p th processor starts to render its fraction of the first line p— 1 time units later than the 
first processor, and the p th processor starts to render its fraction of the last line p — 1 time units earlier 
than the first processor. Thus, the following relations hold for the I\ and 7 2 functions corresponding 
to the first and the last lines: 

(16) h(p, 1) = 7 X (1, 1) + p - 1, 7 2 (1, L) = 7 2 (p, L) +p — 1. 

The maximum idle time for the system of P processors is calculated as follows: 

(17) T e (P) = 72(1, L) - h{ 1, 1) = h{p,L) + (p - 1) - (Ji (p,L) - (p - 1)) = T e (p) + 2(p - 1). 


-!)• 

— 1)) proves the first statement of 


] 1 f 


6 



Therefore, 


(18) T e (p) = T e (P) - 2(p - 1). 

According to Theorem 3.1 T e (P ) = 2L + 2(P - 1), the elapsed time of the p th processor is given 
by 

(19) T e (p) = 2L + 2(P - 1) - 2(p - 1) = 2L + 2(P - p). 

The useful time of each processor is equal to 2L\ therefore, the idle time is equal to 2 (P — p). □ 

3.2. Two-step and first-last pipelined algorithms. We define the two-step and first-last 
pipelined algorithms (FL-PA) starting from two outermost processors. Subsets Ai and A 2 include 
lines starting from the first and the P th processors, respectively. There are two more fluxes of data 
than for the one-way PAs defined in the previous subsection. The functions 1 3 and 1 4 are starting 
times of the forward and backward computations, respectively, for fines belonging to A 2 . These 
functions have to satisfy the following relations similar to (5-7): 

(20) h{p,i) > h(p+ 1* *) + 1, P = P~ 1,..., 1; 

(21) AM > MM) + i; 

(22) I 4 (p,i)>I A (p-l y i) + l, p=2, ...,P; 
where i 6 A 2 . 

Conditions (5-7) must be satisfied for i 6 Ai. The condition (4) with r, s 6 {1,2, 3, 4} and A- 
requirement must be satisfied for all functions / r , where r G {1, 2, 3, 4). According to the definition of 
the maximum elapsed time for P processors, this time for FL-PAs is given by 

(23) T e = max(/ 2 (l,Li),/ 4 (P,L 2 )) - mm(Ji(l, f\), J 3 (P, F 2 )) + 1 

where Fi, F 2 , Li, L 2 are the first and last fines in subsets Ai and A 2 , respectively. 

Wc define a passage as a set of starting times of the pipelined forward or backward computations 
on P processors for the i th fine: 

(24) {I r (p,i)\p= 1,...,P}, {/,(p,0|P= 
where r 6 {1,3}, 5 G {2, 4}. 

The FL-PAs drive the |Ai| + |A 2 | = L passages from the first to the last processor governed by 
1 1 and 1 4 functions and the L passages from the last to the first processor governed by / 2 and J3 
functions. 

Now wc will define the following reformulation of the FL-PA. This reformulated algorithm uses 
the same passages as the original FL-PA, i.c., 


(25) 

Vi,n 

r 2 : 

4T(p,i) 

= ^r 2 (p,i); 

(26) 

Vi, Si 

s 2 : 

C(P4) 



where p= n,r 2 G {1,3}, s lt s 2 G {2,4}. 
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The lines start from the first processor in the following order: F U ...,L\ (the forward step) then 
F 2 , ...) L2 (the backward step), and from the last processor, F 2 ,...,L 2 (the forward step), and then 
(the backward step). Here indices 1 and 2 correspond to subsets Ai and A 2j respectively. 
Now all backward computations starting from an outermost processor are performed later than the 
forward computations starting from the same outermost processor: 

(27) Via G Ai Vi 2 € A 2 : I?" fail) < ^"(** 2 ), < i? ew (p,n). 

Obviously, conditions (5,7,20,22) are satisfied for each passage of the FL-PA. The reformulated 
algorithm meets these conditions as it uses the same passages as the original one. Conditions (6,21) 
are satisfied because each processor completes forward step computations for each line no later than 
and begins backward step computations for each line no earlier than in the original algorithm. 

The forward step computations for the first lines (Fa and F 2 ) for the reformulated and original 
FL-PA start at the same time. The backward step computations for the last lines for the reformulated 
and original FL-PA (L\ and L 2 ) are completed at the same times. Thus, the maximum elapsed time 
for this algorithm is equal to that for the original algorithm for the same number of processors. 

Lemma 3.3. The reformulation of the FL-PA described above does not change the maximum 
elapsed time . 

We define the symmetric FL-PA (SFL-PA) as a particular case of the FL-PA so that 

(28) h(p,ii) = h(P-p+l,i 2), h(p,h) = h{P — P + M 2 ), 

where |Ai| = |A 2 j = Lj 2, i\ € Aj, i 2 £ A 2 , p = 1, ..., F. This definition is valid for an even number 
of processors P and an even total number of lines L. 

All first-last pipelined Thomas algorithms described in the previous section fall into the class of 
SFL-PA (sec Fig. lc,d). 

In a more common sense, one may reformulate any FL-PA in such a way that the lines belonging 
to Ai start from the last outermost processor and the lines belonging to A 2 start from the first 
processor. Obviously, this reformulation docs not change the maximum elapsed time. Therefore, T e is 
a symmetric function of its arguments (T e (7i, 7 2 , h, h) = T e (/ 3 , / 4 , /a, / 2 )), and the case of symmetric 
FL-PA I i = 7 3 , / 2 = I 4 corresponds to the local minimum of elapsed time of FL-PAs. 

Theorem 3.4. For the SFL-PA } the maximum elapsed time is greater than or equal to 2L + 2(P— 
2). There exist functions I\ y / 2 , I3 and I 4 meeting conditions 20-22) such thatT e = 2L+2(P — 2) 
for these I functions. 

Proof The method of mathematical induction on the number of processors P with the induction 
step equal to two is used to prove the first part of the theorem. 

For P — 2 and L — 2 the processors can perform computations with zero idle time. The first and 
the second processor start the forward step computations simultaneously for the first and the second 
line. At the end of the first time unit, they exchange the interface data and complete the forward step 
(the first processor renders the second line and vice versa). Processors compute the backward step 
for two lines the same way. The functions J r , r — 1, 2, 3, 4 arc given by: 

7i(l, 1) = 1, Ii(2,l) = 2, 73(2,2) = 1, 7 3 (2,1)=2, 

(29) 7 2 (1, 1) = 4, 72(2,1) = 3, J 3 (l,2)=3, 7 3 (2,2) = 4. 

For P — 2 and L = 2m, m > 1, the lines are coupled and treated as described above couple by 
couple. The elapsed time is T e = 2L = 2L + 2(2 — 2). 
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Suppose that the statement of induction is valid for p < P. Assume that for p = P + 2 the elapsed 
time is T e (P + 2) < 2 L + 2((P + 2 ) - 2 ) for some set of I satisfying conditions (4-7, 20 - 22 ). 

Let’s reformulate this algorithm according to Lemma 3.3. This procedure docs not change the 
elapsed time. Obviously, the reformulated algorithm is also symmetrical. 

Now switch off the two outermost processors. Define the new set J” ew; ,r € { 1 , 2 , 3, 4} on P 
processors 


I? ew (pJx) = Ji(p+ Mi), 

ir w (j>M) = h{p+l>i2), 

(30) ir w ( P,i a) = /4 ( P+ 1 ,*2 ) -2, 

where p = 1,...,P, fi ,2 = 1, 

The SFL-PA characterized by the set 7 nen; starts the backward computations two time units earlier 
than the assumed SFL-PA on P + 2 processors. There is no forward or backward step computations 
on switched-off processors; therefore, the backward step computations on the second and on the 
(P + l) th processors can start immediately after completion the forward computations for the last 
lines ( L\ € Ai , L 2 e A 2 ). 

According to Lemma 3.3, forward computations for ii E Aj are executed earlier than backward 
computations for z 2 € A 2 . According to the definition of symmetrical FL-PA, all forward computations 
arc performed earlier than all backward computations on all processors. Therefore, condition (4) is 
satisfied for 7 corresponding to different, forward and backward, pipelined computations. The set 
jnew mcc t s conditions (4-7, 20-22) as /[^ are equal to and arc equal to 7 2j 4 — 2. 

The elapsed time for this SFL-PA on P processors is derived in the same manner as in the proof 
of Theorem 3.1 (see (11)): 


T e (P) = ir w ( w - ir w (hFi) + 1 = 

7 2 (2, L 0 - Ji(2, Fi) + 1 = (/ 2 (1, L{) - 3) - (^(l, F x ) + 1) = 
(31) T e (P + 2) - 4 < 2L + 2(P - 2 ). 


This contradiction of the induction hypothesis proves the first statement of the theorem. 
One example with the minimum elapsed time is presented here for P = 2 q and L = 2m : 


(32) 


{ p + l — 1 if p < q 

m + q — 1 + (p — q) + 1 — 1 if p> q\ 


1 < l < m 


(33) 


h 


P — p + l — m if p > q 

m + q— \ + q — p + l — m if p < q. 


m +1 <1 <L 


The functions / 2 and I 4 are analogous to 1 3 and 7i, respectively. One half of the lines is rendered by 
the forward step beginning from the first processor and the other half is computed starting from the 
last processor. After completion of the first half of lines by the processors l,...,g and simultaneous 
completion of the second half by the processors P, P — 1 , q + 1 , processors q and q - hi exchange 
the interface data and then continue to render lines by the forward step. The first q processors now 
render the second half of lines and vice versa. After a simultaneous completion of the forward step, 
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processors compute the backward step the same way. For both groups of processors, it takes m 4- q — 1 
units of time to render m lines either by backward or forward step. 

These subsets of q processors are referred to as single super-processors, and subsets Ai and A 2 of 
m lines arc referred to as single super- lines. Thus, there are two super-processors and two super- lines, 
as in the basis of induction, which has been considered in the proof of this Theorem. Thus, the elapsed 
time is equal to 4 (m + q — 1) = 2L + 2 (P — 2). □ 

Therefore, SFL-PAs have only a small advantage in terms of the maximum idle processor time 
over the basic PTA. 

3.3. Applications to the pipelined Thomas algorithms. All pipelined Thomas algorithms 
described in the second section belong to PAs. Explicit derivation of the I functions can be a cumber- 
some task for some versions of PTA. It is difficult to do this for the First-Last Immediate Backward 
pipelined Thomas Algorithm (Fig. Id) since it treats four fluxes of data simultaneously. Instead, one 
can use Theorems 3.1 and 3.4 to estimate the elapsed time and the processor idle time. 

The first two algorithms, the basic PTA (Fig. la) and the Immediate Backward pipelined Thomas 
Algorithm (IB-PTA) (Fig. lb) have equal minimum elapsed times (Theorem 3.1). The first-last 
algorithms (Fig. lc,d) have a small advantage of two time units idle time for an even number of 
lines (Theorem 3.4). Thus, these pipelined Thomas algorithms do not reduce the processor idle time 
considerably. However, for the IB-PTA and for the FL-IB-PTA there are completed lines by the time 
processors become idle. These processors can be used for other computational tasks while they are 
idle from the Thomas algorithm. 

For real MIMD computers, there is a latency time of communications between processors. There- 
fore, it is advantageous to solve a number of lines per single message, i.c., to transfer coefficients (for 
the forward step) or the solution (for the backward step) for a number of lines. 

The computational times per node may not be equal for forward and backward steps. For example, 
there arc three multiplication operations per grid point for the forward step and two multiplication 
operations per grid point for the backward step of the tridiagonal Thomas algorithm. The number of 
portions of lines may be different for forward and backward steps of the pipelined Thomas algorithm. 
Those lines that have been completed by the forward step are gathered in groups of lines on the last 
processor (see the next section for details). Sets A j and A& include portions of lines (not single lines) 
for the forward and backward computations, respectively. The same number of lines is treated by 
forward and backward steps: 

(34) |A S \K X = \A„\K 2 , 

where K\ and K 2 are numbers of lines solved per message for forward and backward steps, respectively. 

To avoid idle time, the amount of computational work should be equal for the forward and 
backward steps of the Immediate Backward PTAs. Thus, numbers of lines solved per one message for 
forward and backward steps must satisfy the following equation: 

(35) NK igi =NK 2 g 2 , 

where £ 1 , <72 are computational times per grid node for forward and backward steps. Therefore, the 
computational time per fraction of a portion of lines assigned to a single processor is taken to be 
equal to unity for forward and backward step computations. As in the previous cases of the line- 
by-line treatment, processors switch between forward and backward steps of the IB-PTA. However, 
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processors may treat several portions of lines by the forward step computations prior to switching to 
the backward step computations (compare Fig. 2a and b). 

To analyze these versions of the pipelined Thomas algorithms which render lines portion-by- 
portion, we define the extended pipelined algorithms (EPA) that arc analogous to the PAs defined in 
subsection 3.1. The I functions must satisfy relations analogous to (4-7), where i>j are the numbers 
of portions of lines. The condition (6) is given by 

(36) I 1 (P,i)<I 2 (PJ) i 

where iK\ < jKi. This condition states that all lines belonging to the i th portion has been completed 
by the forward step computations prior to the start of the backward step computations for any of 
these lines. 

The first-last portion-by- portion algorithms are analogous to FL-PTAs (subsection 3.2). The I 
functions should satisfy conditions analogous to (4-7) and (20-22). We define a symmetric FL-EPA, 
denoted as SFL-EPA, as follows: 

(37) = J 3 (P - p + 1, * 2 /), h(p,i\b) = h{P-p + M 2 b), 

where |Ai/| = |A 2 /| = Lfj 2 , | Ai^l = | A 2 fo| = Lb/ 2 , iij e Ai /, z 2 / € A 2 /, iib £ Ap,, 12 b £ A 2 &, p = 
1 

The proofs of Theorems 3.1 and 3.4 can be used straightforwardly to prove the following Theorem. 
Theorem 3.5. For the EPA , the maximum elapsed time of the system of processors is greater 
than or equal to 


T e = L f + Lb + 2(P — 1 ). 

For the SFL-EPA , the maximum elapsed time is greater than or equal to 

T e = L f Lb + 2(P — 2 ). 


The various formulations of the PTA considered in the previous section can be formulated in this 
case where portions of lines are treated per single interface message. If the same number of lines, say, 
K\ for the forward step and AT 2 for the backward step, are solved per message for the basic EPA and 
the IB- EPA, the minimum of processor idle times is equal for these algorithms. 

4. Generation of processor schedule. At each time unit each processor either performs for- 
ward or backward step computations or is idle. For the basic PTA the processor scheduling is simple: 
each processor executes the forward step, becomes idle and then executes the backward step. Each pro- 
cessor starts computations immediately after the corresponding data are available from its neighbor. 
In this case, communications govern computational tasks and there is no necessity of the processor 
scheduling before computations. 

For other considered versions of the Thomas pipelined algorithms, there arc more simultaneous 
fluxes of data in a system of processors and the order of processor tasks is more complicated. For 
example, completed forward step coefficients from the previous processor are not transfered to the 
current processor immediately after they have been computed, as this processor may execute the 
backward step computations at the next time unit. Our earlier attempts to govern processors by 
communications have led to cumbersome code patterns. The static processor schedule should be 


11 



computed prior to running the PTA. This schedule governs the consequence of computations and 
communications on each processor. 

To set up this schedule, let us define the variable J(p , i) that governs the sequence of the forward 
and backward computations and the idle state: 


(38) 




+1 

0 

-1 


forward step computations 
processor is idle 
backward step computations, 


where p is the number of processors and i is the serial number of the time unit. The first time unit 
on each processor corresponds to computation of the first portion of lines by the forward step on this 
processor. Therefore, the 7(p, i) and 7(p + 1, i) correspond to two following time units. Use of this 
variable is more convenient for computer programming of the pipelined Thomas algorithms rather 
than the explicit use of the I(p, l) functions. 

We confine ourselves to one-way PTAs in this section. 

Theorem 4.1. For any given pipelined Thomas algorithm on P processors with the minimum 
elapsed time there exists the PTA on P+1 processors with the same processor schedule for all processors 
except the first ( outermost ) processor such that the elapsed time of this PTA is minimum . 

Proof Consider the pipelined Thomas algorithm with minimum elapsed time ( 2 L + 2 (P - 1 )) on 
P processors. We construct here a schedule for the additional processor meeting conditions of this 
theorem. 

Renumber processors l,...,Pto 2 ,...,P+l. First, we formulate the schedule for the first (new) 
processor as follows: 


J(M) = 1 if 7(2, i) = 1 
J(l,* + 2 ) = -l if 7(2, i) = — 1 
(39) 7(1, i) —0 otherwise, 

where index i runs from 1 to 2 L + 2(P — 1 ) on the second processor. 

There arc two more elapsed time units on the first processor than on the second one, therefore, 
this algorithm has the minimum elapsed time, 2 L + 2(P - 1) + 2, on (P + 1 ) processors. 

Now we have to check that the first processor computes no more than one portion of lines at 
each time unit (condition (4) in terms of I functions). This means that the assignment ( 39 ) does not 
assign 1 and -1 to the same i th time unit on the first processor. This situation may occur only as 
a result of a switch from the backward step to the forward step on the second processor such that 
J( 2 , i c — 2) = -1 and 7(2, i c ) = 1 . Note that the switch from the backward step to the forward step, 
such that 7(2, z c — 2 ) = — 1 , 7(2, i c — 1 ) = 1 and 7(2, i c ) = — 1 , docs not lead to this collision. - 
Consider the first switch leading to the collision. The assignment (39) maps one-to-one the set 
{1 < i < i c — 1 1*7(2, i) 7 ^ 0 } into the set {1 < i < i c + 1 [*7(1 , z) ^ 0}. There exists at least two 
7(l,z),z = 1 , + 1 that are equal to zero. As 7(1, i c ) = 1 there exists at least one 1 < i < i c — 1 

such that 7(1, i) = 0. Consequently, there is an idle time unit on the first processor which may be 
used for the forward step computations preceding those at the i l c h time unit on the second processor. 

After every switch from the backward to the forward step, there is a switch from the forward to 
the backward step, as the backward step computations always perform last. Consider a switch from 
the forward to the backward step, i.e., 7(2, z e ) = 1 and 7(2, i e + 1 ) = — 1 . The value 7(1, i e + 1 ) is 


1 I! 
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equal to zero unless J(2,i e — 1) = — 1 (sec (39)). In this case there is no collision due to switching 
from the backward to the forward step (see above). After each collision, an additional vacant time 
unit appears due to the forward-to-backward switch, and it might be used to resolve the next collision. 
Therefore, we adopt the following single- valued assignment: 

J(himin) =1 ^ J(2,t) = l 

J(l,i + 2) = -1 if J(2,i) = -1 

(40) J(l,t) =0 otherwise, 

where im in = min(l < j < i\ J(l, j) = 0). 

The obtained schedule meets the pipelined nature of PTAs (5,7) as for any portion of lines the 
forward computations arc performed earlier on the first processor than on the second one, and the 
backward computations are performed earlier on the second processor than in the first one. Obviously, 
the inequality (6) is satisfied, as there arc no changes of the schedule of the last outermost processor. 
This proves the theorem. □ 

Let us define a variable that governs communications between processors: 

r 

0 processors p and p + 1 do not communicate 

, _ . _ , .. 1 send forward-step coefficients from p to p + 1 

(41) C7om(p, i) — 

2 receive solution from p + 1 to p 

3 send forward- step coefficients and receive solution, 

where p refers to communications between the p th and the (p+ 1)^ computers, i is the end of the i th 
time unit (or beginning of the ( i -j- l) th time unit) on the p th processor. 

If J(p, i) is given, the variable Com{p, z) is computed by 

1 if J(p + 1, i) = 1 

2 if J(p + 1, i — 1) = — 1 and J{p + 1, i) / 1 

3 if J(p + 1, i — 1) = — 1 and J(p 4- 1, i) = 1 
0 otherwise. 

The end of the i th time unit on the p th processor corresponds to the beginning of the i th time 
unit on the ( p + l) th processor. If J(p + l,i) = 1, the (p + l) th processor receives the forward step 
coefficients. The first line in (42) defines the transfer of the forward step coefficients from the p th to 
the {p -h l) th processor. 

The (p-j-l) <h processor sends the backward step solution to the p th processor immediately after the 
completion of their computations. The second line of (42) defines the solution transfer for backward 
pipelined computations. 

The third line of (42) corresponds to the case of simultaneous transfer of the forward step co- 
efficients from the p th to the (p -f l)* 1 2 3 * * * 7 * processor and the backward step solution from the (p + l) th 
to the p th processor. If the send and receive operations are performed simultaneously, parallelism of 
communications appears as an additional advantage of the IB-PTA. 

One recursive algorithm that assigns the schedule of computations to the p th processor and the 
schedule of communications between the p th and the (p - f l) th processors for a given schedule of the 
{p + l) th processor, is shown in Fig. 4. First, this algorithm assigns zero values to J and Com 
variables corresponding to the p th processor. For negative J(p + 1, z) the values J(p, i -j- 2) = -1 and 
Com(p, i + 1) = 2 are assigned (sec (40) and (42)). 
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For positive J(p + l,t), the value of Com{p ) I) is equal cither to 1 or to 3. The latter case is 
realized if J(p+ 1, i — 1) = — 1 (see (42)), and the previous Com{p ) i) = 2 is reassigned. Then the 
search for a vacant time unit i m i n (see (40)) is realized and the assignment J(p, z mm ) — 1 is completed. 
The entire loop is repeated 2(P - p) -f- Lj + times according to Theorem 3.5. 

Theorem 4.1 states that the algorithm is correct. The only reason that the algorithm is unable 
to find a vacant time unit for the forward step computations (error warning) is an erroneous schedule 
(J(Pji)) on the last outermost processor (see below). 

Thus, a valid schedule should be assigned to the P th outermost processor before recursive com- 
putations of the time schedule for other processors are executed. This schedule must satisfy condition 
(4). To keep the minimum idle time for the system of P processors, the P ih processor should execute 
forward and backward step computations in a contiguous way. If the outermost processor is idle some 
time, we cannot obtain a schedule corresponding to the minimum elapsed time on other processors. 

The assignment of J(P, i) for the basic PTA is simple: 

(43) J(P, z) = 1 if i=l } ...,Ly 

J(P,*) = - 1 if * = £/ + !,..., L/ + L*. 

The algorithm for assignment of J(P, i) for the IB-PTA is shown in Fig. 5. First, the forward step 
computations are performed for the first portion of lines. The variable KS is equal to the number of 
available lines which have been completed by the forward step and not yet rendered by the backward 
step. 

Scheduling the IB-PTA for the P th processor, the backward step computations begin for the next 
portion of K<i lines once these lines have been completed by the forward step [KS > Af 2 )- Otherwise, 
one more portion of lines must be completed by the forward step computations. 

For example, consider the tri-diagonal Thomas algorithm with K^jK\ ~ g\/g 2 = 1.5. 1 Once 
two first portions of K\ lines have been completed by the forward step, 1.5 K\ lines from these two 
portions form the first portion for the backward step computations. The remaining 0.5K\ lines and 
the third portion of lines form the second portion of K 2 fines for the backward step computations. 
This loop repeats until the completion of fines. 

The processor schedules for the IB-PTA and the basic PTA arc shown in Fig. 6a, b respectively. 
The p th column of values of J corresponds to the p ih processor (from 1 to P). Columns are shifted so 

as each horizontal fine corresponds to a single time moment in terms of wall clock. Arrows >, 

< and < > denote the send, receive and send- receive communications that correspond to 

the values of the variable Com (sec (42)). One can find that the maximum elapsed time (length of the 
first column) is the same for the PTA and the IB-PTA. Of course, the idle time (the number of zeros) 
is also the same. The advantage of the IB-PTA is that the processors become idle after completion of 
some fines, therefore, the processors may be used for other computational tasks using these data. 

5. Conclusions. The ways of efficient parallelization of two-step (forward-backward) pipelined 
algorithms (PAs) originated from the solution of banded matrix linear systems on parallel computers 
are discussed. The recursive algorithm for the assignment of the processor computation and commu- 
nication schedule for PAs is derived. This scheduling algorithm can be used to execute multiblock 
computations where different fines come to a processor from different neighbors, to reduce idle time 

1 the exact ratio is compiler- and computer-dependent 


by scheduling other computational tasks while processors are idle from PA computations and to im- 
plement those versions of PAs which can improve the parallelization efficiency. 

It is shown theoretically that the processor idle time of the pipelined Thomas algorithms cannot 
be improved directly. 

To remedy the problem of the processor idle time, the new parallel version of the Thomas algorithm 
(IB-PTA) is proposed and denoted as the Immediate Backward Pipelined Thomas Algorithm (IB- 
PTA). The backward step is processed immediately after the forward step has been completed for the 
first portion of lines. Thus, part of the lines have been computed by the Thomas algorithm before 
processors become idle, and the idle processors can perform other computational data-dependent 
tasks. The obtained processor schedule for the proposed IB-PTA and basic PTA are presented. These 
schedules confirm our theoretical results about PAs. 
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Fig. 1 . Examples of pipelined Thomas algorithms (a) basic Pipelined Thomas Algorithm (PTA); (b) Immediate 
Backward Pipelined Thomas Algorithm (IB-PTA); (c) First-Last Pipelined Thomas Algorithm (FL-PTA); (d) First- 

Last Immediate Backward Pipelined Thomas Algorithm (FL-IB-PTA), where > denotes the forward step, — » 

denotes the backward step 


I II 


16 



a 


1+1 


1+2 1+3 1+4 1+5 
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Fig. 2. Sequence of forward and backward steps for the IB-PTA at two sequential time units, where I,. ..,1+5 
are the numbers of processors, F is the forward step, B is the backward step, arrows denote data transfer between 
processors ; (a) equal computational times for the F and B steps (g± — < 72 ); (b ) non-equal computational times for the 
F and B steps, corresponding to the tri-diagonal Thomas algorithm {g\ — 1 . 592 ) 
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Fig. 4. Assignment of processor schedule on the (p + l)* h processor based on a given schedule on the p th processor 
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Fig. 5. Assignment of the 
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